
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help
Dataset exploration using various types of data analysis and data visualization techniques and save results in files in Excel format.Feature Engineering to improve data, select the best features, and many more.Machine learning models that can predict patient's disease status.Perform predictions on new example data given and export theprediction result.Save BEST Model and use it later for deployment.import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Loading, Preprocessing, Analysis Libraries
import numpy as np
import pandas as pd
# Visulaiztion Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
%matplotlib inline
# Model Training And Testing libraries
from sklearn.model_selection import train_test_split
# Model Algorithms Libraries
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
# Pre-Processing Libraries
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
# Metrics & Hyper Parameter Libraries
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, confusion_matrix, recall_score, accuracy_score, precision_score, f1_score, classification_report, f1_score, roc_curve
from sklearn.model_selection import cross_val_score, GridSearchCV, RepeatedStratifiedKFold, KFold, StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
# Best Features Selection For Each Category Libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import chi2
# Profiling Libraries
from ydata_profiling import ProfileReport
import os
# Library For Model Comparison
from sklearn.model_selection import KFold
This dataset is created manually. This dataset is created to predict whether a patient has heart disease or not. This dataset contains cardiatic information on patients and the diagnosis results of whether the patient has heart disease.
Machine learning models are necessary to determine whether a patient has heart disease and speed up the diagnostic process based on the medical information provided about that patient. The variables that most influence a patient to have heart disease will also be explored more deeply in this notebook.
# loading the csv data to a Pandas DataFrame
heart_data = pd.read_csv(r"C:\Users\acer\Downloads\IDS Project\Dataset\Heart.csv")
profile = ProfileReport(heart_data, title="Heart Disease Report", explorative=True)
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
# Save the report to the specified path
profile.to_file(r"C:\Users\acer\Downloads\IDS Project\Dataset_Report.html")
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
| Attribute | Description | Emoji |
|---|---|---|
| Age | Age of the patient [Years] | 👵👴 |
| Sex | Sex of the patient [M: Male, F: Female] |
🚹🚺 |
| ChestPainType | Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] |
❤️🩹 |
| RestingBP | Resting blood pressure [mm Hg] | 💉 |
| Cholesterol | Serum cholesterol [mm/dl] | 🩸 |
| FastingBS | Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] |
🧁 |
| RestingECG | Resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality, LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria] |
🩺 |
| MaxHR | Maximum heart rate achieved [Numeric value between 60 and 202] | 💓 |
| Exercise Angina | Exercise-induced angina [Y: Yes, N: No] |
🏃♂️🚫 |
| Oldpeak | oldpeak = ST [Numeric value measured in depression] |
📉 |
| ST_Slope | The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] |
📈 |
| VCF | Number of major blood vessels [Values: 0-3] | 🔢 |
| Smoking | Smoking status [1: Yes (is a smoker), 0: No (is or is not a smoker)] |
🚬🚭 |
| Creatine | Level of the CPK enzyme in the blood [mcg/L (0-2500)] | 🧪 |
| Thal | Thalassemia status [Values ranging from 0-3] | 🧬 |
| HeartDisease | Output class [1: Has Disease, 0: No Disease] |
❤️🩹❤️ |

OutCome of this phase is as given below :
# print first 5 rows of the dataset
heart_data.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | Exercise Agina | Oldpeak | ST_Slope | VCF | Smoking | Creatine | Thal | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 2 | 0 | 168 | 3 | Y |
| 1 | 49 | F | NAP | 160 | 180 | 1 | Normal | 156 | N | 1.0 | Flat | 0 | 0 | 155 | 3 | Y |
| 2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 | 1 | 125 | 3 | Y |
| 3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 | 0 | 161 | 3 | Y |
| 4 | 54 | M | NAP | 150 | 195 | 1 | Normal | 122 | N | 0.0 | Up | 3 | 0 | 106 | 2 | Y |
# print last 5 rows of the dataset
heart_data.tail()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | Exercise Agina | Oldpeak | ST_Slope | ca | smoking | Creatine | Diabetes | thal | HeartFailed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1395 | 53 | F | ASY | 130 | 130 | 1 | ST | 120 | Y | 2.0 | Flat | 3 | 1 | 120 | 1 | 1 | Y |
| 1396 | 38 | M | ASY | 138 | 138 | 0 | LVH | 139 | Y | 2.5 | Up | 3 | 1 | 139 | 1 | 1 | Y |
| 1397 | 53 | F | ATA | 117 | 117 | 0 | Normal | 108 | Y | 2.0 | Flat | 0 | 1 | 108 | 1 | 1 | Y |
| 1398 | 62 | M | ATA | 121 | 121 | 0 | Normal | 148 | Y | 2.5 | Up | 2 | 1 | 148 | 1 | 1 | Y |
| 1399 | 50 | M | TA | 193 | 179 | 1 | LVH | 92 | N | 0.4 | Flat | 0 | 1 | 92 | 1 | 1 | N |
print("The shape of the dataset is : ")
heart_data.shape
The shape of the dataset is :
(1400, 16)
# getting some info about the data
heart_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1400 entries, 0 to 1399 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1400 non-null int64 1 Sex 1400 non-null object 2 ChestPainType 1400 non-null object 3 RestingBP 1400 non-null int64 4 Cholesterol 1400 non-null int64 5 FastingBS 1400 non-null int64 6 RestingECG 1400 non-null object 7 MaxHR 1400 non-null int64 8 Exercise Agina 1400 non-null object 9 Oldpeak 1400 non-null float64 10 ST_Slope 1400 non-null object 11 VCF 1400 non-null int64 12 Smoking 1400 non-null int64 13 Creatine 1400 non-null int64 14 Thal 1400 non-null int64 15 HeartDisease 1400 non-null object dtypes: float64(1), int64(9), object(6) memory usage: 175.1+ KB
def percent_counts(df, feature):
total = df[feature].value_counts(dropna=False)
percent = round(df[feature].value_counts(dropna=False, normalize=True) * 100, 2)
percent_count = pd.concat([total, percent], keys=['Total', 'Percentage'], axis=1)
return percent_count
# Path to save the file
output_path = r"C:\Users\acer\Downloads\IDS Project\Features_counts.xlsx"
# Create a Pandas Excel writer using XlsxWriter as the engine.
with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
for feature in heart_data.columns:
df_counts = percent_counts(heart_data, feature)
# Convert the DataFrame to an XlsxWriter Excel object.
df_counts.to_excel(writer, sheet_name=feature)
print(f"Report saved to {output_path}")
Report saved to C:\Users\acer\Downloads\IDS Project\Features_counts.xlsx
def save_descriptive_statistics_excel(df, numeric_columns, output_path):
# Create a Pandas Excel writer using XlsxWriter as the engine.
with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
for col in numeric_columns:
stats = df[[col]].describe()
stats.to_excel(writer, sheet_name=col)
print(f"Descriptive statistics saved to {output_path}")
# List of numerical columns
num = heart_data.select_dtypes(include=['float64', 'int64'])
num
| Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | VCF | Smoking | Creatine | Thal | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 140 | 289.0 | 0 | 172 | 0.0 | 2.0 | 0 | 168 | 3 |
| 1 | 49 | 160 | 180.0 | 0 | 156 | 1.0 | 0.0 | 0 | 155 | 3 |
| 2 | 37 | 130 | 283.0 | 0 | 98 | 0.0 | 0.0 | 1 | 125 | 3 |
| 3 | 48 | 138 | 214.0 | 0 | 108 | 1.5 | 1.0 | 0 | 161 | 3 |
| 4 | 54 | 150 | 195.0 | 0 | 122 | 0.0 | 2.5 | 0 | 106 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1395 | 53 | 130 | 130.0 | 0 | 120 | 2.0 | 2.5 | 1 | 120 | 1 |
| 1396 | 38 | 138 | 138.0 | 0 | 139 | 2.5 | 2.5 | 1 | 139 | 1 |
| 1397 | 53 | 117 | 117.0 | 0 | 108 | 2.0 | 0.0 | 1 | 108 | 1 |
| 1398 | 62 | 121 | 121.0 | 0 | 148 | 2.5 | 2.0 | 1 | 148 | 1 |
| 1399 | 50 | 170 | 179.0 | 0 | 92 | 0.4 | 0.0 | 1 | 92 | 1 |
1400 rows × 10 columns
# List of numerical columns
num = heart_data.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Path to save the file
output_path = r"C:\Users\acer\Downloads\IDS Project\Numerical_stats.xlsx"
# Call the function to save descriptive statistics
save_descriptive_statistics_excel(heart_data, num, output_path)
Descriptive statistics saved to C:\Users\acer\Downloads\IDS Project\Numerical_stats.xlsx
heart_data.select_dtypes(include=['object']).describe()
| Sex | ChestPainType | RestingECG | Exercise Agina | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|
| count | 1400 | 1400 | 1400 | 1400 | 1400 | 1400 |
| unique | 2 | 4 | 3 | 2 | 3 | 2 |
| top | M | ASY | Normal | N | Flat | Y |
| freq | 856 | 760 | 780 | 815 | 709 | 700 |
percent_counts(heart_data, "HeartDisease")
| Total | Percentage | |
|---|---|---|
| Y | 709.0 | 50.64 |
| N | 684.0 | 48.86 |
| NaN | 7.0 | 0.50 |
| NaN | 7.0 | 0.50 |
def get_null_indices(df, feature):
null_indices = df[df[feature].isnull()].index
return null_indices
null_indices = get_null_indices(heart_data, "HeartDisease")
null_indices
Int64Index([1385, 1386, 1387, 1388, 1389, 1390, 1391], dtype='int64')
percent_counts(heart_data, "HeartDisease")
| Total | Percentage | |
|---|---|---|
| Y | 700.0 | 50.0 |
| N | 700.0 | 50.0 |
| NaN | 0.0 | 0.0 |
continuous_values = []
categorical_values = []
for column in data.columns:
if data[column].dtype == 'int64' or data[column].dtype == 'float64':
continuous_values.append(column)
else:
categorical_values.append(column)
categorical_values
['Sex', 'ChestPainType', 'RestingECG', 'Exercise Agina', 'ST_Slope', 'HeartDisease']
continuous_values
['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak', 'ca', 'smoking', 'Creatine', 'Diabetes', 'thal']
# checking for missing values
print("Missing values: \n")
heart_data.isnull().sum()
### Good, we did not find any null value in the dataset
Missing values:
Age 0 Sex 0 ChestPainType 0 RestingBP 0 Cholesterol 0 FastingBS 0 RestingECG 0 MaxHR 0 Exercise Agina 0 Oldpeak 0 ST_Slope 0 VCF 0 Smoking 0 Creatine 0 Thal 0 HeartDisease 0 dtype: int64
data_dup = heart_data.duplicated().any()
print(data_dup)
heart_data = heart_data.drop_duplicates()
print("After removing duplicates: \n")
heart_data.shape
#### As the answer is False, So it meant that my Dataset does not contain DUPLICATE VALUES
False After removing duplicates:
(1400, 16)
dict = {}
for i in list(heart_data.columns):
dict[i] = heart_data[i].value_counts().shape[0]
pd.DataFrame(dict,index=["unique count"]).transpose()
| unique count | |
|---|---|
| Age | 50 |
| Sex | 2 |
| ChestPainType | 4 |
| RestingBP | 69 |
| Cholesterol | 266 |
| FastingBS | 2 |
| RestingECG | 3 |
| MaxHR | 120 |
| Exercise Agina | 2 |
| Oldpeak | 53 |
| ST_Slope | 3 |
| VCF | 5 |
| Smoking | 2 |
| Creatine | 269 |
| Thal | 4 |
| HeartDisease | 2 |
def outlier_detect(df, col):
q1_col = Q1[col]
iqr_col = IQR[col]
q3_col = Q3[col]
return df[((df[col] < (q1_col - 1.5 * iqr_col)) |(df[col] > (q3_col + 1.5 * iqr_col)))]
# ---------------------------------------------------------
def outlier_detect_normal(df, col):
m = df[col].mean()
s = df[col].std()
return df[((df[col]-m)/s).abs()>3]
# ---------------------------------------------------------
def lower_outlier(df, col):
q1_col = Q1[col]
iqr_col = IQR[col]
q3_col = Q3[col]
lower = df[(df[col] < (q1_col - 1.5 * iqr_col))]
return lower
# ---------------------------------------------------------
def upper_outlier(df, col):
q1_col = Q1[col]
iqr_col = IQR[col]
q3_col = Q3[col]
upper = df[(df[col] > (q3_col + 1.5 * iqr_col))]
return upper
# ---------------------------------------------------------
def replace_upper(df, col):
q1_col = Q1[col]
iqr_col = IQR[col]
q3_col = Q3[col]
tmp = 9999999
upper = q3_col + 1.5 * iqr_col
df[col] = df[col].where(lambda x: (x < (upper)), tmp)
df[col] = df[col].replace(tmp, upper)
print('outlier replace with upper bound - {}' .format(col))
# ---------------------------------------------------------
def replace_lower(df, col):
q1_col = Q1[col]
iqr_col = IQR[col]
q3_col = Q3[col]
tmp = 1111111
lower = q1_col - 1.5 * iqr_col
df[col] = df[col].where(lambda x: (x > (lower)), tmp)
df[col] = df[col].replace(tmp, lower)
print('outlier replace with lower bound - {}' .format(col))
# ---------------------------------------------------------
# Writing Formulas For Upper & Lower Quartiles
Q1 = heart_data.quantile(0.25, numeric_only=True)
Q3 = heart_data.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1
### <b><span style='color:#C40C0C'>6.5.2 </span> | Finding Variables With Outliers Values</b>
for i in range(len(continuous_values)):
print("IQR => {}: {}".format(continuous_values[i], outlier_detect(heart_data, continuous_values[i]).shape[0]))
print("Z_Score => {}: {}".format(continuous_values[i], outlier_detect_normal(heart_data, continuous_values[i]).shape[0]))
print("********************************")
IQR => Age: 0 Z_Score => Age: 0 ******************************** IQR => RestingBP: 42 Z_Score => RestingBP: 12 ******************************** IQR => Cholesterol: 12 Z_Score => Cholesterol: 8 ******************************** IQR => FastingBS: 217 Z_Score => FastingBS: 0 ******************************** IQR => MaxHR: 4 Z_Score => MaxHR: 0 ******************************** IQR => Oldpeak: 19 Z_Score => Oldpeak: 18 ******************************** IQR => VCF: 126 Z_Score => VCF: 27 ******************************** IQR => Smoking: 0 Z_Score => Smoking: 0 ******************************** IQR => Creatine: 159 Z_Score => Creatine: 16 ******************************** IQR => Thal: 0 Z_Score => Thal: 0 ********************************
outlier = []
for i in range(len(continuous_values)):
if outlier_detect(heart_data[continuous_values],continuous_values[i]).shape[0] !=0:
outlier.append(continuous_values[i])
print("Numerical Variables With Outlier Values : ")
outlier
Numerical Variables With Outlier Values :
['RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak', 'VCF', 'Creatine']
for i in range(len(outlier)):
replace_upper(heart_data, outlier[i])
print("\n********************************\n")
for i in range(len(outlier)):
replace_lower(heart_data, outlier[i])
### As you can see now there not any outlier in numerical features of the dataset.
outlier replace with upper bound - RestingBP outlier replace with upper bound - Cholesterol outlier replace with upper bound - FastingBS outlier replace with upper bound - MaxHR outlier replace with upper bound - Oldpeak outlier replace with upper bound - VCF outlier replace with upper bound - Creatine ******************************** outlier replace with lower bound - RestingBP outlier replace with lower bound - Cholesterol outlier replace with lower bound - FastingBS outlier replace with lower bound - MaxHR outlier replace with lower bound - Oldpeak outlier replace with lower bound - VCF outlier replace with lower bound - Creatine
outlier = []
for i in range(len(continuous_values)):
if outlier_detect(heart_data[continuous_values],continuous_values[i]).shape[0] !=0:
outlier.append(continuous_values[i])
print("Numerical Variables With Outlier Values : ")
outlier
Numerical Variables With Outlier Values :
[]
1. Memory Usage 💾¶2. Dataset Overview🗂️¶3. Data Quality Check 🛠️¶4. Outlier Detection 🚨¶5. Summary 🌟¶higher number of males (856) compared to females (544).760 occurrences."No" responses being more frequent (815).Flat (709).Yes: 700, No: 700).Conclusion 📝¶# Create subplots with 5 rows and 2 columns
fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(19, 21))
# Define colors for plots
colors = ['#4D3425', '#E4512B', '#5A9BD4', '#FFD700', '#4CAF50', '#F08080', '#808000', '#87CEEB', '#9370DB', '#20B2AA', '#8B4513']
# Columns to visualize
columns_to_visualize = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak', 'VCF', 'Smoking', 'Creatine', 'Thal']
# Plot histograms for each column
for i in range(9):
plt.subplot(5, 2, i + 1) # Adjust subplot index
current_color = colors[i % len(colors)]
sns.histplot(heart_data[columns_to_visualize[i]], kde=True, bins=20, color=current_color, edgecolor='black')
plt.title(f'Distribution of {columns_to_visualize[i]}')
plt.xlabel(columns_to_visualize[i])
# Remove the last subplot
plt.delaxes(ax[4, 1])
plt.tight_layout()
plt.show()
plt.rcParams['axes.facecolor'] = '#f6f5f5'
# Define color palette
color_palette = ["#800000", "#8000ff", "#6aac90", "#5833ff", "#da8829"]
# Create figure and axes
fig, axs = plt.subplots(3, 2, figsize=(17, 17))
# Define categorical columns
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'Exercise Agina', 'ST_Slope']
# Plot count plots for each categorical column
for column, ax in zip(categorical_columns, axs.flatten()):
sns.countplot(x=column, data=heart_data, palette=color_palette, ax=ax)
ax.set_xlabel('')
ax.set_ylabel('Count')
ax.set_title(column)
# Remove unused subplots
for i in range(len(categorical_columns), axs.size):
axs.flatten()[i].axis('off')
# Adjust layout
plt.tight_layout()
# Show plot
plt.show()
import plotly.express as px
# Define a template
temp = go.layout.Template(
layout=go.Layout(
title_font=dict(family="Arial", size=21, color="black"),
legend=dict(font=dict(family="Arial", size=12)),
# Add more layout settings as needed
)
)
# Scatter plot for patients with and without heart disease
fig_combined = px.scatter_matrix(heart_data,
dimensions=["Age", "Cholesterol", "RestingBP", "MaxHR", "Oldpeak"],
title='Features Comparison for Patients with and without Heart Disease',
color='HeartDisease', symbol='HeartDisease',
color_discrete_sequence=["#FFDAB9", "#8B0000"],
symbol_sequence=["circle", "circle"],
template=temp)
# Update marker attributes
fig_combined.update_traces(marker=dict(size=15, opacity=.7, line_width=1),
diagonal_visible=False, showupperhalf=False)
# Update layout to increase the size of the plots and add custom legend
fig_combined.update_layout(height=1000, width=1000,
legend=dict(
title="Heart Disease",
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1
))
# Show the combined plot
fig_combined.show()
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Group data by categories
categories = ['ChestPainType', 'Sex', 'RestingECG', 'Exercise Agina', 'ST_Slope']
figs = []
# Create a figure for each category
for category in categories:
grouped_data = heart_data.groupby(['HeartDisease', category]).size().unstack(fill_value=0)
colors = ['#BE6B6B', '#FF9999', '#C1D2D1', '#598885', '#E5BAB4'] # Define colors
fig = go.Figure()
# Add traces for each category type
for i, cat_type in enumerate(grouped_data.columns):
fig.add_trace(go.Bar(x=grouped_data.index, y=grouped_data[cat_type], name=cat_type, marker_color=colors[i]))
# Update layout
fig.update_layout(showlegend=True, barmode='group', bargap=0.15, legend_title_text=category, height=400, width=800, plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig.update_xaxes(title_text="Heart Disease")
fig.update_yaxes(title_text="Frequency")
figs.append(fig)
# Show plots
for fig in figs:
fig.show()
# Define colors and calculate percentages for the pie chart
colors = ["#8B0000", "#FFDAB9", "#8B008B", "#FF8C00"]
counts = heart_data['HeartDisease'].value_counts()
percentages = [counts[1] / counts.sum() * 100, counts[0] / counts.sum() * 100]
# Create figure and axes
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 6)) # Adjusted figsize for better readability
# Plot the pie chart
ax[0].pie(percentages, labels=['Heart Disease', 'No Heart Disease'], autopct='%1.1f%%', startangle=90,
explode=(0.1, 0), colors=colors[:2], wedgeprops={'edgecolor': 'black', 'linewidth': 1, 'antialiased': True})
ax[0].set_title('Heart Disease Distribution (%)')
# Plot the count plot
sns.countplot(x='HeartDisease', data=heart_data, palette=colors[:2], ax=ax[1])
ax[1].set_title('Cases of Heart Disease')
ax[1].set_xlabel('Heart Disease')
ax[1].set_ylabel('Count')
ax[1].set_xticklabels(['No Heart Disease', 'Heart Disease'])
# Adjust layout
fig.tight_layout(pad=3)
plt.show()
# Compute descriptive statistics for individuals with and without heart disease
heart_disease = heart_data[heart_data['HeartDisease'] == 'Y'].describe().T
no_heart_disease = heart_data[heart_data['HeartDisease'] == 'N'].describe().T
# Create a figure with two subplots side-by-side
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 8)) # Increased figsize for better readability
# Plot heatmap for individuals with heart disease
sns.heatmap(heart_disease[['mean']], annot=True, cmap='YlOrRd', linewidths=0.4, linecolor='black', cbar=False, fmt='.2f', ax=ax[0])
ax[0].set_title('Heart Disease')
# Plot heatmap for individuals without heart disease
sns.heatmap(no_heart_disease[['mean']], annot=True, cmap='YlGnBu', linewidths=0.4, linecolor='black', cbar=False, fmt='.2f', ax=ax[1])
ax[1].set_title('No Heart Disease')
# Adjust layout for better spacing
fig.tight_layout(pad=3)
plt.show()
df_corr= heart_data[continuous_values].corr()
df_corr
| Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | VCF | Smoking | Creatine | Thal | |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.187222 | -0.071334 | 0.021653 | -0.251494 | 0.142371 | 0.017135 | 0.004155 | 0.112963 | -0.038278 |
| RestingBP | 0.187222 | 1.000000 | 0.138005 | 0.053381 | -0.074368 | 0.096835 | -0.004460 | 0.016034 | 0.032899 | 0.019768 |
| Cholesterol | -0.071334 | 0.138005 | 1.000000 | -0.010789 | 0.155643 | 0.060347 | -0.036740 | -0.065205 | 0.089271 | 0.315564 |
| FastingBS | 0.021653 | 0.053381 | -0.010789 | 1.000000 | -0.027226 | -0.032672 | 0.107413 | 0.013005 | 0.011581 | 0.011035 |
| MaxHR | -0.251494 | -0.074368 | 0.155643 | -0.027226 | 1.000000 | -0.085468 | 0.000727 | -0.059284 | -0.005993 | 0.043660 |
| Oldpeak | 0.142371 | 0.096835 | 0.060347 | -0.032672 | -0.085468 | 1.000000 | 0.011560 | 0.024903 | 0.044719 | 0.068832 |
| VCF | 0.017135 | -0.004460 | -0.036740 | 0.107413 | 0.000727 | 0.011560 | 1.000000 | 0.028354 | 0.058569 | 0.022094 |
| Smoking | 0.004155 | 0.016034 | -0.065205 | 0.013005 | -0.059284 | 0.024903 | 0.028354 | 1.000000 | -0.020810 | -0.012151 |
| Creatine | 0.112963 | 0.032899 | 0.089271 | 0.011581 | -0.005993 | 0.044719 | 0.058569 | -0.020810 | 1.000000 | 0.125868 |
| Thal | -0.038278 | 0.019768 | 0.315564 | 0.011035 | 0.043660 | 0.068832 | 0.022094 | -0.012151 | 0.125868 | 1.000000 |
plt.figure(figsize=(19,7))
sns.heatmap(df_corr, annot = True, cmap = 'YlGnBu')
plt.title('Correlation Matrix of Continuous Variables')
plt.show()
Correlation Insights¶Heart Disease Correlations:
💓 Exercise Agina (0.81) : Strongly positively correlated. Individuals experiencing angina during exercise are more likely to have heart disease.💔 Chest Pain Type (-0.17) : Moderately negatively correlated. Certain types of chest pain may be associated with a lower risk of heart disease.Other Variable Correlations with Heart Disease:
❤️ MaxHR (-0.22): Moderately negatively correlated. Individuals with lower maximum heart rates during exercise are more likely to have heart disease.ST_Slope (-0.22): Moderately negatively correlated. Certain patterns in the ST segment during exercise may indicate a lower risk of heart disease.Resting ECG (0.05): Weakly positively correlated. Abnormalities in resting electrocardiographic results may slightly increase the likelihood of heart disease.Other Variable Correlations:
Age and Cholesterol (0.32): Moderately positively correlated. Older individuals tend to have higher cholesterol levels.Cholesterol and Thal (0.32): Moderately positively correlated. Certain types of thalassemia may influence cholesterol levels.categorical_values
['Sex', 'ChestPainType', 'RestingECG', 'Exercise Agina', 'ST_Slope', 'HeartDisease']
plt.figure(figsize=(20, 15))
num_cols = len(heart_data.columns)
num_rows = (num_cols // 4) + (num_cols % 4 > 0)
for i, col in enumerate(heart_data.columns, 1):
plt.subplot(num_rows, 4, i)
plt.title(f"Distribution of {col} Data")
sns.histplot(heart_data[col], kde=True, color = 'green', alpha = 0.5)
plt.tight_layout()
plt.show()
# Create a LabelEncoder object
le = LabelEncoder()
df1 = heart_data.copy(deep=True)
# Apply Label Encoding using a loop
for col in categorical_values:
df1[col] = le.fit_transform(df1[col]).astype('int64')
df1.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | Exercise Agina | Oldpeak | ST_Slope | VCF | Smoking | Creatine | Thal | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 1 | 1 | 140 | 289 | 0 | 1 | 172 | 0 | 0.0 | 2 | 2 | 0 | 168 | 3 | 1 |
| 1 | 49 | 0 | 2 | 160 | 180 | 1 | 1 | 156 | 0 | 1.0 | 1 | 0 | 0 | 155 | 3 | 1 |
| 2 | 37 | 1 | 1 | 130 | 283 | 0 | 2 | 98 | 0 | 0.0 | 2 | 0 | 1 | 125 | 3 | 1 |
| 3 | 48 | 0 | 0 | 138 | 214 | 0 | 1 | 108 | 1 | 1.5 | 1 | 1 | 0 | 161 | 3 | 1 |
| 4 | 54 | 1 | 2 | 150 | 195 | 1 | 1 | 122 | 0 | 0.0 | 2 | 3 | 0 | 106 | 2 | 1 |
df = df1[categorical_values].corr()
df
| Sex | ChestPainType | RestingECG | Exercise Agina | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|
| Sex | 1.000000 | 0.005235 | 0.042704 | 0.078186 | -0.014215 | 0.073271 |
| ChestPainType | 0.005235 | 1.000000 | -0.022056 | -0.220928 | 0.133794 | -0.165447 |
| RestingECG | 0.042704 | -0.022056 | 1.000000 | 0.087585 | -0.035039 | 0.054045 |
| Exercise Agina | 0.078186 | -0.220928 | 0.087585 | 1.000000 | -0.267247 | 0.812468 |
| ST_Slope | -0.014215 | 0.133794 | -0.035039 | -0.267247 | 1.000000 | -0.218887 |
| HeartDisease | 0.073271 | -0.165447 | 0.054045 | 0.812468 | -0.218887 | 1.000000 |
# Calculate correlations excluding the 'HeartDisease' column
corr = df1.drop('HeartDisease', axis=1).corrwith(df1['HeartDisease']).sort_values(ascending=False).to_frame()
corr.columns = ['Correlations']
# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr, annot=True, annot_kws={"size": 11.5}, fmt='.2f', cmap='RdBu_r', center=0, linewidths=0.1, alpha=0.9)
plt.title('Correlation with Heart Disease (excluding HeartDisease)')
plt.xticks(rotation=0)
plt.yticks(rotation=0)
plt.show()
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1400 entries, 0 to 1399 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1400 non-null int64 1 Sex 1400 non-null int64 2 ChestPainType 1400 non-null int64 3 RestingBP 1400 non-null int64 4 Cholesterol 1400 non-null int64 5 FastingBS 1400 non-null int64 6 RestingECG 1400 non-null int64 7 MaxHR 1400 non-null int64 8 Exercise Agina 1400 non-null int64 9 Oldpeak 1400 non-null float64 10 ST_Slope 1400 non-null int64 11 VCF 1400 non-null int64 12 Smoking 1400 non-null int64 13 Creatine 1400 non-null int64 14 Thal 1400 non-null int64 15 HeartDisease 1400 non-null int64 dtypes: float64(1), int64(15) memory usage: 185.9 KB
As You Can See All The Object Type Columns Have Been Enoded Into Integer Type.
plt.figure(figsize=(17, 15))
num_cols = len(df1.columns)
num_rows = (num_cols // 4) + (num_cols % 4 > 0)
for i, col in enumerate(df1.columns, 1):
plt.subplot(num_rows, 4, i)
plt.title(f"Distribution of {col} Data")
sns.histplot(df1[col], kde=True, color = 'Darkred', alpha = 0.5)
plt.tight_layout()
plt.show()
encoded_categorical_columns = df1[['Sex', 'ChestPainType', 'RestingECG', 'Exercise Agina', 'ST_Slope', 'HeartDisease']]
excel_path = r"C:\Users\acer\Downloads\IDS Project\Categorical_Encoding.xlsx"
encoded_categorical_columns.to_excel(excel_path, index=False)
print("Excel File Created successfully!")
Excel File Created successfully!

# Create scaler objects
mms = MinMaxScaler() # For Min-Max scaling (Normalization)
df2 = df1.copy(deep=True)
# Apply scaling using a loop
for col in continuous_values:
df2[col] = mms.fit_transform(df2[[col]])
df2.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | Exercise Agina | Oldpeak | ST_Slope | VCF | Smoking | Creatine | Thal | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.244898 | 1 | 1 | 0.70 | 0.479270 | 0.0 | 1 | 0.788732 | 0 | 0.295455 | 2 | 0.50 | 0.0 | 0.052328 | 1.000000 | 1 |
| 1 | 0.428571 | 0 | 2 | 0.80 | 0.298507 | 1.0 | 1 | 0.676056 | 0 | 0.409091 | 1 | 0.00 | 0.0 | 0.047636 | 1.000000 | 1 |
| 2 | 0.183673 | 1 | 1 | 0.65 | 0.469320 | 0.0 | 2 | 0.267606 | 0 | 0.295455 | 2 | 0.00 | 1.0 | 0.036810 | 1.000000 | 1 |
| 3 | 0.408163 | 0 | 0 | 0.69 | 0.354892 | 0.0 | 1 | 0.338028 | 1 | 0.465909 | 1 | 0.25 | 0.0 | 0.049802 | 1.000000 | 1 |
| 4 | 0.530612 | 1 | 2 | 0.75 | 0.323383 | 1.0 | 1 | 0.436620 | 0 | 0.295455 | 2 | 0.75 | 0.0 | 0.029953 | 0.666667 | 1 |
# Define the path to save the CSV file
csv_path = r"C:\Users\acer\Downloads\IDS Project\Normalization.csv"
# Save the DataFrame with scaled columns to a CSV file
df2.to_csv(csv_path, index=False)
print("CSV File Created successfully!")
CSV File Created successfully!
# Use all columns in categorical_values, excluding the target column
features = df2[categorical_values].drop(columns=['HeartDisease'])
X = features.iloc[:, :]
y = heart_data['HeartDisease']
# Applying SelectKBest with chi-squared test
best_features = SelectKBest(score_func=chi2, k='all')
fit = best_features.fit(X, y)
# Creating a DataFrame to store the chi-squared scores
featureScores = pd.DataFrame(data={'Feature': X.columns, 'Chi Squared Score': fit.scores_})
# Sort the features by their chi-squared scores in descending order
featureScores = featureScores.sort_values(by='Chi Squared Score', ascending=False)
# Print selected features and their scores
print("Selected Features and Chi Squared Scores:")
print(featureScores)
# Plotting the chi-squared scores
plt.subplots(figsize=(5, 5))
sns.heatmap(featureScores.set_index('Feature'), annot=True, linewidths=0.4, linecolor='black', fmt='.2f')
plt.title('Selection of Categorical Features (Excluding HeartFailed)')
plt.show()
Selected Features and Chi Squared Scores:
Feature Chi Squared Score
3 Exercise Agina 537.984615
1 ChestPainType 44.181818
4 ST_Slope 19.091929
0 Sex 2.920561
2 RestingECG 1.937984
# Separating features and target variable
X_continuous = df2[continuous_values]
y_continuous = df2['HeartDisease']
# Applying ANOVA
f_values, p_values = f_classif(X_continuous, y_continuous)
# Creating a DataFrame to store the results
anova_results = pd.DataFrame(data={'F-value': f_values, 'p-value': p_values}, index=X_continuous.columns)
# Displaying the results
print("ANOVA Results:")
print(anova_results)
# Plotting the results
plt.figure(figsize=(8, 5))
sns.barplot(x=anova_results['F-value'], y=anova_results.index, palette='coolwarm')
plt.title('ANOVA F-values')
plt.xlabel('F-value')
plt.ylabel('Features')
plt.show()
ANOVA Results:
F-value p-value
Age 24.123926 1.009666e-06
RestingBP 4.136714 4.215104e-02
Cholesterol 0.140297 7.080425e-01
FastingBS 1.968707 1.608072e-01
MaxHR 30.831145 3.364223e-08
Oldpeak 109.160511 1.178386e-24
VCF 0.188899 6.639020e-01
Smoking 0.026966 8.695862e-01
Creatine 6.124673 1.344867e-02
Thal 0.017430 8.949856e-01
Chi Squared Scores¶Insight:
ANOVA Results¶Insight
strong influence on the target.lower F-values and higher p-values, suggesting weaker influences.Summarized Insights:¶are key predictors of the target variable.
selected_features = ['Age', 'ChestPainType', 'MaxHR' ,'Exercise Agina', 'Oldpeak', 'ST_Slope']
# Extract the selected features from the DataFrame
features = df2[selected_features].values
target = df2['HeartDisease'].values
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)
# Extract the selected features from the DataFrame
X = df2.drop(columns='HeartDisease', axis=1).values
Y = df2['HeartDisease'].values
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)
# Initialize and fit SelectKBest
Kbest_classif = SelectKBest(score_func=f_classif, k=6)
Kbest_classif.fit(x_train, y_train)
# Print the scores for the features
for i in range(len(Kbest_classif.scores_)):
print(f'Feature {i} : {round(Kbest_classif.scores_[i], 3)}')
# Plot the feature scores
plt.bar(df2.drop(columns='HeartDisease').columns, Kbest_classif.scores_)
plt.xticks(rotation=90)
plt.rcParams["figure.figsize"] = (8, 6)
plt.show()
Feature 0 : 18.626 Feature 1 : 6.584 Feature 2 : 29.905 Feature 3 : 0.959 Feature 4 : 0.192 Feature 5 : 3.668 Feature 6 : 6.748 Feature 7 : 22.826 Feature 8 : 1977.81 Feature 9 : 80.13 Feature 10 : 41.493 Feature 11 : 0.682 Feature 12 : 0.119 Feature 13 : 4.775 Feature 14 : 0.155
# transform training set
x_train_classif = Kbest_classif.transform(x_train)
print("X_train.shape: {}".format(x_train.shape))
print()
print("X_train_selected.shape: {}".format(x_train_classif.shape))
print()
# transform test data
x_test_classif = Kbest_classif.transform(x_test)
X_train.shape: (1120, 15) X_train_selected.shape: (1120, 6)
# Get the selected feature indices
selected_feature_indices = Kbest_classif.get_support(indices=True)
# Get the column names of the selected features
selected_feature_names = df2.drop(columns='HeartDisease').columns[selected_feature_indices]
# Display the column names
print("Selected feature names:")
print(selected_feature_names)
Selected feature names:
Index(['Age', 'ChestPainType', 'MaxHR', 'Exercise Agina', 'Oldpeak',
'ST_Slope'],
dtype='object')
print("Training set features shape:", x_train_classif.shape)
print("Testing set features shape:", x_test_classif.shape)
print("Training set target shape:", y_train.shape)
print("Testing set target shape:", y_test.shape)
Training set features shape: (1120, 6) Testing set features shape: (280, 6) Training set target shape: (1120,) Testing set target shape: (280,)
# Extract the selected features from the DataFrame
features = df2[selected_features]
target = df2['HeartDisease']
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)
# Save the training features
x_train_path = r'C:\Users\acer\Downloads\IDS Project\x_train.csv'
if os.path.exists(x_train_path):
os.remove(x_train_path)
x_train_df = pd.DataFrame(data=x_train_classif)
x_train_df.to_csv(x_train_path, index=False)
print("Training features saved successfully!")
# Save the testing features
x_test_path = r'C:\Users\acer\Downloads\IDS Project\x_test.csv'
if os.path.exists(x_test_path):
os.remove(x_test_path)
x_test_df = pd.DataFrame(data=x_test_classif)
x_test_df.to_csv(x_test_path, index=False)
print("Testing features saved successfully!")
# Save the training target
y_train_path = r'C:\Users\acer\Downloads\IDS Project\y_train.csv'
if os.path.exists(y_train_path):
os.remove(y_train_path)
y_train_df = pd.DataFrame(data=y_train, columns=['HeartDisease'])
y_train_df.to_csv(y_train_path, index=False)
print("Training target saved successfully!")
# Save the testing target
y_test_path = r'C:\Users\acer\Downloads\IDS Project\y_test.csv'
if os.path.exists(y_test_path):
os.remove(y_test_path)
y_test_df = pd.DataFrame(data=y_test, columns=['HeartDisease'])
y_test_df.to_csv(y_test_path, index=False)
print("Testing target saved successfully!")
Training features saved successfully! Testing features saved successfully! Training target saved successfully! Testing target saved successfully!

def model_evaluation(classifier, x_test, y_test):
# Confusion Matrix
cm = confusion_matrix(y_test, classifier.predict(x_test))
names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
# Format confusion matrix values
labels = [['{}\n{}'.format(name, value) for name, value in zip(names, row)] for row in cm]
sns.heatmap(cm, annot=labels, fmt='', annot_kws={"size": 14})
plt.title('Confusion Matrix')
plt.show()
# Classification Report
print("\nClassification Report:\n", classification_report(y_test, classifier.predict(x_test)))
def model(classifier, x_train, y_train, x_test, y_test):
classifier.fit(x_train, y_train)
prediction = classifier.predict(x_test)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# Calculate metrics
train_accuracy = accuracy_score(y_train, classifier.predict(x_train))
test_accuracy = accuracy_score(y_test, prediction)
precision = precision_score(y_test, prediction)
recall = recall_score(y_test, prediction)
f1 = f1_score(y_test, prediction)
cross_val_score_mean = cross_val_score(classifier, x_train, y_train, cv=cv, scoring='roc_auc').mean()
roc_auc = roc_auc_score(y_test, prediction)
print("Training Accuracy: {:.2%}".format(train_accuracy))
print("Testing Accuracy: {:.2%}".format(test_accuracy))
print("Precision: {:.2%}".format(precision))
print("Recall: {:.2%}".format(recall))
print("F1 Score: {:.2%}".format(f1))
print("Cross Validation Score: {:.2%}".format(cross_val_score_mean))
print("ROC_AUC Score: {:.2%}".format(roc_auc))
# Evaluation
model_evaluation(classifier, x_test, y_test)
def kfold_cross_validation(classifier, x_train, y_train, cv, scoring= accuracy_score):
# Use stratified k-fold for classification problems if the classifier supports probability prediction
if hasattr(classifier, 'predict_proba'):
kfold = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
else:
kfold = KFold(n_splits=cv, shuffle=True, random_state=42)
scores = []
for train_index, test_index in kfold.split(x_train, y_train):
x_train_fold, x_test_fold = x_train[train_index], x_train[test_index]
y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
classifier.fit(x_train_fold, y_train_fold)
y_pred = classifier.predict(x_test_fold)
score = scoring(y_test_fold, y_pred) # Use the provided scoring function
scores.append(score)
return scores
I performed hyperparameter tuning for Four models:
We explored different hyperparameters such as n_estimators, max_depth, and min_samples_split using techniques like Grid Search or Random Search to optimize the model's performance.
Similarly, we tuned hyperparameters such as hidden_layer_sizes, activation function, and learning rate to improve the performance of the MLP Classifier.
We explored different hyperparameters such as max_iter, max_leaf_nodes, and max_depth to optimize the model's performance.
We explored different hyperparameters such as n_neighbors, weights, and algorithm to optimize the model's performance.
For this project, I have implemented 5 different machine learning algorithms:
# Expanded parameter grid
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [4, 8, 12, 16, 20],
'n_estimators': [50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'min_samples_split': [2, 5, 7, 11, 13, 15],
'min_samples_leaf': [1, 2, 4, 6, 8, 10],
'max_features': ['sqrt', 'log2']
}
# Create a RandomForestClassifier
classifier_rf = RandomForestClassifier(random_state= 42)
# Use RandomizedSearchCV for parameter tuning
random_search = RandomizedSearchCV(classifier_rf, param_distributions=param_grid, n_iter= 50, cv=5, scoring='accuracy', random_state= 42, n_jobs=-1)
random_search.fit(x_train_classif, y_train)
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
n_iter=50, n_jobs=-1,
param_distributions={'criterion': ['gini', 'entropy'],
'max_depth': [4, 8, 12, 16, 20],
'max_features': ['sqrt', 'log2'],
'min_samples_leaf': [1, 2, 4, 6, 8, 10],
'min_samples_split': [2, 5, 7, 11, 13,
15],
'n_estimators': [50, 100, 200, 300, 400,
500, 600, 700, 800,
900, 1000]},
random_state=42, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
n_iter=50, n_jobs=-1,
param_distributions={'criterion': ['gini', 'entropy'],
'max_depth': [4, 8, 12, 16, 20],
'max_features': ['sqrt', 'log2'],
'min_samples_leaf': [1, 2, 4, 6, 8, 10],
'min_samples_split': [2, 5, 7, 11, 13,
15],
'n_estimators': [50, 100, 200, 300, 400,
500, 600, 700, 800,
900, 1000]},
random_state=42, scoring='accuracy')RandomForestClassifier(random_state=42)
RandomForestClassifier(random_state=42)
# Print the best parameters
print("Best Parameters:", random_search.best_params_)
Best Parameters: {'n_estimators': 1000, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 20, 'criterion': 'entropy'}
# Get the best model
best_classifier = random_search.best_estimator_
model(best_classifier, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.64% Testing Accuracy: 92.86% Precision: 98.53% Recall: 88.16% F1 Score: 93.06% Cross Validation Score: 89.87% ROC_AUC Score: 93.30%
Classification Report:
precision recall f1-score support
0 0.88 0.98 0.93 128
1 0.99 0.88 0.93 152
accuracy 0.93 280
macro avg 0.93 0.93 0.93 280
weighted avg 0.93 0.93 0.93 280
k_fold_scores = kfold_cross_validation(best_classifier_rf, x_train_classif, y_train, cv=5, scoring=accuracy_score)
print("Mean Accuracy: {:.2f} %".format(np.mean(k_fold_scores)*100))
print("Std. Dev: {:.2f} %".format(np.std(k_fold_scores)*100))
Mean Accuracy: 89.38 % Std. Dev: 1.37 %
As Lower Std.Dev Means Model performance is Consistent
# Use the best model for predictions on the test set with selected features
y_pred_rf = best_classifier_rfc.predict(x_test_classif)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_df_rf = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_rf })
# Display the result dataframe
result_df_rf.head(10)
| Actual | Predicted | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 0 | 0 |
| 7 | 0 | 0 |
| 8 | 0 | 0 |
| 9 | 1 | 1 |
# Define the file path for saving the CSV file
file_path = r"C:\Users\acer\Downloads\IDS Project\RandomForestPredictions.csv"
# Save the result dataframe to a CSV file
result_df_rf.to_csv(file_path, index=False)
print(f"Results saved to {file_path}")
Results saved to C:\Users\acer\Downloads\IDS Project\RandomForestPredictions.csv
def predict_heart_disease(model, input_data):
# Convert input data to a list
input_data_as_list = list(input_data)
# Reshape the list as we are predicting for only one instance
input_data_reshaped = [input_data_as_list]
# Make prediction using the model
prediction = model.predict(input_data_reshaped)
# Return the prediction
return prediction[0]
# Function to take input from the user
def get_user_input():
age = int(input("Enter Age: "))
chest_pain_type = int(input("Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): "))
resting_ecg = int(input("Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): "))
max_hr = int(input("Enter Max Heart Rate: "))
exercise_angina = int(input("Enter Exercise-Induced Angina (0 for No, 1 for Yes): "))
oldpeak = float(input("Enter Oldpeak: "))
st_slope = int(input("Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): "))
# Convert all input data to a list
input_data_as_list = [age, chest_pain_type, resting_ecg, max_hr, exercise_angina, oldpeak, st_slope]
return input_data_as_list
input_data2 = get_user_input()
result2 = predict_heart_disease(best_classifier_rfc, input_data2)
# Print results
print("\nIndividual 2:", "Heart Disease" if result2 == 1 else "No Heart Disease")
Enter Age: 46 Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): 0 Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): 1 Enter Max Heart Rate: 112 Enter Exercise-Induced Angina (0 for No, 1 for Yes): 0 Enter Oldpeak: 0 Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): 2 Individual 2: No Heart Disease

# Define the MLPClassifier
mlp = MLPClassifier(random_state=42, max_iter=1000)
# Define the parameter distributions to search
param_dist = {
'hidden_layer_sizes': [(100,), (50, 50), (50, 100, 50)],
'activation': ['relu', 'tanh', 'logistic'],
'alpha': [0.0001, 0.001, 0.01, 0.1],
'learning_rate': ['constant', 'adaptive'],
}
# Initialize RandomizedSearchCV
random_search_mlp = RandomizedSearchCV(estimator=mlp, param_distributions=param_dist, n_iter=50, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
# Fit RandomizedSearchCV
random_search_mlp.fit(x_train_classif, y_train)
RandomizedSearchCV(cv=5,
estimator=MLPClassifier(max_iter=1000, random_state=42),
n_iter=50, n_jobs=-1,
param_distributions={'activation': ['relu', 'tanh',
'logistic'],
'alpha': [0.0001, 0.001, 0.01, 0.1],
'hidden_layer_sizes': [(100,), (50, 50),
(50, 100, 50)],
'learning_rate': ['constant',
'adaptive']},
random_state=42, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5,
estimator=MLPClassifier(max_iter=1000, random_state=42),
n_iter=50, n_jobs=-1,
param_distributions={'activation': ['relu', 'tanh',
'logistic'],
'alpha': [0.0001, 0.001, 0.01, 0.1],
'hidden_layer_sizes': [(100,), (50, 50),
(50, 100, 50)],
'learning_rate': ['constant',
'adaptive']},
random_state=42, scoring='accuracy')MLPClassifier(max_iter=1000, random_state=42)
MLPClassifier(max_iter=1000, random_state=42)
# Print the best parameters
print(f'Best parameters for MLPClassifier: {random_search_mlp.best_params_}')
# Print the best score
print(f'Best cross-validation accuracy for MLPClassifier: {random_search_mlp.best_score_}')
Best parameters for MLPClassifier: {'learning_rate': 'constant', 'hidden_layer_sizes': (50, 50), 'alpha': 0.01, 'activation': 'logistic'}
Best cross-validation accuracy for MLPClassifier: 0.89375
# Use the best estimator to make predictions
best_mlp = random_search_mlp.best_estimator_
best_mlp = MLPClassifier(activation='logistic', alpha=0.01, learning_rate='constant' ,hidden_layer_sizes=(50, 50), max_iter=1000)
model(best_mlp, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.38% Testing Accuracy: 92.86% Precision: 98.53% Recall: 88.16% F1 Score: 93.06% Cross Validation Score: 88.78% ROC_AUC Score: 93.30%
Classification Report:
precision recall f1-score support
0 0.88 0.98 0.93 128
1 0.99 0.88 0.93 152
accuracy 0.93 280
macro avg 0.93 0.93 0.93 280
weighted avg 0.93 0.93 0.93 280
# Use the best model for predictions on the test set with selected features
y_pred_mlp = best_mlp.predict(x_test_classif)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_df_mlp = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_mlp})
# Display the result dataframe
result_df_mlp.head(10)
| Actual | Predicted | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 0 | 0 |
| 7 | 0 | 0 |
| 8 | 0 | 0 |
| 9 | 1 | 1 |
# Define the file path for saving the CSV file
file_path = r"C:\Users\acer\Downloads\IDS Project\MLP_Predictions.csv"
# Save the result dataframe to a CSV file
result_df_mlp.to_csv(file_path, index=False)
print(f"Results saved to Designated File Path")
Results saved to Designated File Path

classifier_dt = DecisionTreeClassifier(random_state = 42, max_depth = 20, min_samples_leaf = 4, min_samples_split = 2)
model(classifier_dt, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 92.50% Testing Accuracy: 87.86% Precision: 90.41% Recall: 86.84% F1 Score: 88.59% Cross Validation Score: 87.95% ROC_AUC Score: 87.95%
Classification Report:
precision recall f1-score support
0 0.85 0.89 0.87 128
1 0.90 0.87 0.89 152
accuracy 0.88 280
macro avg 0.88 0.88 0.88 280
weighted avg 0.88 0.88 0.88 280
# Use the best model for predictions on the test set with selected features
y_pred_dt = classifier_dt.predict(x_test_classif)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_dt = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_dt })
# Display the result dataframe
result_dt.head()
| Actual | Predicted | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 0 | 0 |
# Define the parameter distribution for HistGradientBoostingClassifier
param_dist_hgb = {
'learning_rate': [0.01, 0.1, 0.2, 0.3],
'max_iter': [100, 200, 400, 800, 1000],
'max_leaf_nodes': [5, 10, 20, 30],
'min_samples_leaf': [1, 5, 10, 15, 20],
}
# Initialize the HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(random_state=42)
# Initialize RandomizedSearchCV
random_search_hgb = RandomizedSearchCV(estimator=hgb, param_distributions=param_dist_hgb, n_iter=50, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
# Fit RandomizedSearchCV
random_search_hgb.fit(x_train_classif, y_train)
RandomizedSearchCV(cv=5,
estimator=HistGradientBoostingClassifier(random_state=42),
n_iter=50, n_jobs=-1,
param_distributions={'learning_rate': [0.01, 0.1, 0.2, 0.3],
'max_iter': [100, 200, 400, 800, 1000],
'max_leaf_nodes': [5, 10, 20, 30],
'min_samples_leaf': [1, 5, 10, 15, 20]},
random_state=42, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5,
estimator=HistGradientBoostingClassifier(random_state=42),
n_iter=50, n_jobs=-1,
param_distributions={'learning_rate': [0.01, 0.1, 0.2, 0.3],
'max_iter': [100, 200, 400, 800, 1000],
'max_leaf_nodes': [5, 10, 20, 30],
'min_samples_leaf': [1, 5, 10, 15, 20]},
random_state=42, scoring='accuracy')HistGradientBoostingClassifier(random_state=42)
HistGradientBoostingClassifier(random_state=42)
# Print the best parameters
print(f'Best parameters for HistGradientBoosting: {random_search_hgb.best_params_}')
best_hgg = random_search_hgb.best_estimator_
# Print the best score
print(f'Best cross-validation accuracy for HistGradientBoosting: {random_search_hgb.best_score_}')
Best parameters for HistGradientBoosting: {'min_samples_leaf': 20, 'max_leaf_nodes': 5, 'max_iter': 1000, 'learning_rate': 0.01}
Best cross-validation accuracy for HistGradientBoosting: 0.89375
hist = HistGradientBoostingClassifier(learning_rate = 0.01, max_iter = 1000, max_leaf_nodes = 5, min_samples_leaf = 20, random_state = 42)
model(hist, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.82% Testing Accuracy: 91.79% Precision: 96.40% Recall: 88.16% F1 Score: 92.10% Cross Validation Score: 89.75% ROC_AUC Score: 92.13%
Classification Report:
precision recall f1-score support
0 0.87 0.96 0.91 128
1 0.96 0.88 0.92 152
accuracy 0.92 280
macro avg 0.92 0.92 0.92 280
weighted avg 0.92 0.92 0.92 280
# Use the best model for predictions on the test set with selected features
y_pred_h = hist.predict(x_test_classif)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_h = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_h })
# Display the result dataframe
result_h.head()
| Actual | Predicted | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 2 | 0 | 0 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
# Define the expanded parameter grid
param_grid_knn = {
'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']
}
# Initialize the KNeighborsClassifier
knn = KNeighborsClassifier()
# Initialize GridSearchCV
grid_search_knn = GridSearchCV(estimator=knn, param_grid=param_grid_knn, cv=5, scoring='accuracy', n_jobs=-1)
# Fit GridSearchCV
grid_search_knn.fit(x_train_classif, y_train)
GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
'weights': ['uniform', 'distance']},
scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
'weights': ['uniform', 'distance']},
scoring='accuracy')KNeighborsClassifier()
KNeighborsClassifier()
# Print the best parameters
print("Best parameters found:")
print(grid_search_knn.best_params_)
best_params_knn = grid_search_knn.best_params_
# Print the best cross-validation score
print("Best cross-validation accuracy:")
print(grid_search_knn.best_score_)
Best parameters found:
{'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'uniform'}
Best cross-validation accuracy:
0.8928571428571429
# Initialize the KNeighborsClassifier with the best parameters
best_knn_model = KNeighborsClassifier(**best_params_knn)
best_knn_model = KNeighborsClassifier(metric = 'manhattan', n_neighbors = 11, weights = 'uniform')
model(best_knn_model, x_train_classif, y_train, x_test_classif, y_test)
Training Accuracy: 89.46% Testing Accuracy: 92.14% Precision: 97.10% Recall: 88.16% F1 Score: 92.41% Cross Validation Score: 89.77% ROC_AUC Score: 92.52%
Classification Report:
precision recall f1-score support
0 0.87 0.97 0.92 128
1 0.97 0.88 0.92 152
accuracy 0.92 280
macro avg 0.92 0.93 0.92 280
weighted avg 0.93 0.92 0.92 280
# Use the best model for predictions on the test set with selected features
y_pred_knn = best_knn_model.predict(x_test_classif)
# Create a new DataFrame with selected features and predictions for RandomForestClassifier
result_knn = pd.DataFrame({ 'Actual': y_test, 'Predicted': y_pred_knn })
# Display the result dataframe
result_knn.head(9)
| Actual | Predicted | |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 0 | 0 |
| 7 | 0 | 0 |
| 8 | 0 | 0 |
# Define the list of models
models = [
('RF', best_classifier),
('MLP', best_mlp),
('KNN', best_knn_model),
('DT', classifier_dt),
('HC', hist)
]
# Initialize lists to store results and model names
results = []
names = []
# Perform cross-validation for each model and print the results
for name, model in models:
kfold = KFold(n_splits=10, random_state=42, shuffle=True)
cv_results = cross_val_score(model, x_train_classif, y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
mean_accuracy = cv_results.mean()
std_deviation = cv_results.std()
score = f"{name}: {mean_accuracy:.6f} ({std_deviation:.6f})"
print(score)
RF: 0.893750 (0.032550) MLP: 0.893750 (0.032550) KNN: 0.891964 (0.032794) DT: 0.850000 (0.041841) HC: 0.892857 (0.033168)
1️⃣ Random Forest:
Mean Accuracy = 89.64%Std Dev = 1.35%2️⃣ MLP:
Mean Accuracy = 89.38%Std Dev = 1.25%3️⃣ KNN:
Mean Accuracy = 89.20%Std Dev = 1.10%4️⃣ Decision Tree:
Mean Accuracy = 85.00%Std Dev = 2.00%5️⃣ HistGradientBoosting:
Mean Accuracy = 89.29%Std Dev = 1.15%# Create a DataFrame for visualization
data = []
for model_results, model_name in zip(results, names):
for result in model_results:
data.append((model_name, result))
df = pd.DataFrame(data, columns=['Model', 'Score'])
# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
fig.suptitle('Algorithm Comparison')
# Box plot
sns.boxplot(x='Model', y='Score', data=df, ax=ax, palette="Set2")
# Strip plot
sns.stripplot(x='Model', y='Score', data=df, ax=ax, color='black', size=5, jitter=True)
plt.xticks(rotation=45)
plt.show()
def compare_models_metrics(classifiers, x_train, y_train, x_test, y_test, cv_scores):
model_names = []
metrics_summary = {
"Train Accuracy": [],
"Test Accuracy": [],
"Precision": [],
"Recall": [],
"F1 Score": [],
"Cross Val Score": []
}
for (name, classifier), cv_result in zip(classifiers, cv_scores):
print("="*60)
print(f"Model: {name} 🚀")
classifier.fit(x_train_classif, y_train)
prediction = classifier.predict(x_test_classif)
# Calculate metrics
train_accuracy = classifier.score(x_train_classif, y_train)
test_accuracy = accuracy_score(y_test, prediction)
precision = precision_score(y_test, prediction)
recall = recall_score(y_test, prediction)
f1 = f1_score(y_test, prediction)
cross_val_score_mean = cv_result.mean()
metrics_summary["Train Accuracy"].append(train_accuracy)
metrics_summary["Test Accuracy"].append(test_accuracy)
metrics_summary["Precision"].append(precision)
metrics_summary["Recall"].append(recall)
metrics_summary["F1 Score"].append(f1)
metrics_summary["Cross Val Score"].append(cross_val_score_mean)
model_names.append(name)
# Print metrics in a table
table = [
["Train Accuracy", f"{train_accuracy:.2%}"],
["Test Accuracy", f"{test_accuracy:.2%}"],
["Precision", f"{precision:.2%}"],
["Recall", f"{recall:.2%}"],
["F1 Score", f"{f1:.2%}"],
["Cross Validation Score", f"{cross_val_score_mean:.2%}"]
]
print(tabulate(table, headers=["Metric", "Value"], tablefmt="fancy_grid"))
print() # New line for better readability
return metrics_summary, model_names
from tabulate import tabulate
classifiers = [
('Random Forest', best_classifier),
('MLP', best_mlp),
('KNN', best_knn_model),
('Decision Tree', classifier_dt),
('HistClassifier', hist)
]
# Example usage:
metrics_summary, model_names = compare_models_metrics(classifiers, x_train_classif, y_train, x_test_classif, y_test, results)
============================================================ Model: Random Forest 🚀 ╒════════════════════════╤═════════╕ │ Metric │ Value │ ╞════════════════════════╪═════════╡ │ Train Accuracy │ 89.64% │ ├────────────────────────┼─────────┤ │ Test Accuracy │ 92.86% │ ├────────────────────────┼─────────┤ │ Precision │ 98.53% │ ├────────────────────────┼─────────┤ │ Recall │ 88.16% │ ├────────────────────────┼─────────┤ │ F1 Score │ 93.06% │ ├────────────────────────┼─────────┤ │ Cross Validation Score │ 89.38% │ ╘════════════════════════╧═════════╛ ============================================================ Model: MLP 🚀 ╒════════════════════════╤═════════╕ │ Metric │ Value │ ╞════════════════════════╪═════════╡ │ Train Accuracy │ 89.38% │ ├────────────────────────┼─────────┤ │ Test Accuracy │ 92.86% │ ├────────────────────────┼─────────┤ │ Precision │ 98.53% │ ├────────────────────────┼─────────┤ │ Recall │ 88.16% │ ├────────────────────────┼─────────┤ │ F1 Score │ 93.06% │ ├────────────────────────┼─────────┤ │ Cross Validation Score │ 89.38% │ ╘════════════════════════╧═════════╛ ============================================================ Model: KNN 🚀 ╒════════════════════════╤═════════╕ │ Metric │ Value │ ╞════════════════════════╪═════════╡ │ Train Accuracy │ 89.46% │ ├────────────────────────┼─────────┤ │ Test Accuracy │ 92.14% │ ├────────────────────────┼─────────┤ │ Precision │ 97.10% │ ├────────────────────────┼─────────┤ │ Recall │ 88.16% │ ├────────────────────────┼─────────┤ │ F1 Score │ 92.41% │ ├────────────────────────┼─────────┤ │ Cross Validation Score │ 89.20% │ ╘════════════════════════╧═════════╛ ============================================================ Model: Decision Tree 🚀 ╒════════════════════════╤═════════╕ │ Metric │ Value │ ╞════════════════════════╪═════════╡ │ Train Accuracy │ 92.05% │ ├────────────────────────┼─────────┤ │ Test Accuracy │ 84.64% │ ├────────────────────────┼─────────┤ │ Precision │ 84.28% │ ├────────────────────────┼─────────┤ │ Recall │ 88.16% │ ├────────────────────────┼─────────┤ │ F1 Score │ 86.17% │ ├────────────────────────┼─────────┤ │ Cross Validation Score │ 85.00% │ ╘════════════════════════╧═════════╛ ============================================================ Model: HistClassifier 🚀 ╒════════════════════════╤═════════╕ │ Metric │ Value │ ╞════════════════════════╪═════════╡ │ Train Accuracy │ 89.82% │ ├────────────────────────┼─────────┤ │ Test Accuracy │ 91.79% │ ├────────────────────────┼─────────┤ │ Precision │ 96.40% │ ├────────────────────────┼─────────┤ │ Recall │ 88.16% │ ├────────────────────────┼─────────┤ │ F1 Score │ 92.10% │ ├────────────────────────┼─────────┤ │ Cross Validation Score │ 89.29% │ ╘════════════════════════╧═════════╛
Train Accuracy:¶highest train accuracy (92.05%).balanced high train accuracies.Test Accuracy:¶highest test accuracy (92.86%).Precision:¶highest precision (98.53%).Recall:¶Decision Tree, have the same recall (88.16%).F1 Score:¶highest F1 Score (93.06%).Cross Validation Score:¶score of 89.38%.Random Forest¶Random Forest and MLP**¶HistGsadientBoostingClassifier¶Conclusion:¶The Random Forest stands out as the best overall due to its consistently high performance across multiple metrics, making it a robust choice for classification tasks.
def plot_metrics_heatmap(metrics_summary, model_names):
# Create DataFrame with models as columns and metrics as rows
df_metrics = pd.DataFrame(metrics_summary, index=model_names).T
# Define a custom color palette
colors = sns.color_palette("coolwarm", as_cmap=True)
# Plot heatmap
plt.figure(figsize=(12, 8))
sns.set(font_scale=1.2) # Increase font size for better readability
sns.heatmap(df_metrics, annot=True, cmap=colors, fmt=".2f", linewidths=1, linecolor='gray', cbar=True)
plt.title('Comparison of Classifier Metrics', fontsize=16)
plt.xlabel('Model', fontsize=14)
plt.ylabel('Metric', fontsize=14)
plt.yticks(rotation=0) # Keep y-axis labels horizontal
plt.tight_layout()
plt.show()
# Example usage:
plot_metrics_heatmap(metrics_summary, model_names)
import joblib
filename = "RF_Model.joblib"
joblib.dump(best_classifier, filename)
['RF_Model.joblib']
# Load the saved model
loaded_model = joblib.load("RF_Model.joblib")
# Assess the model's performance on the test set
result = loaded_model.score(x_test, y_test)
print("Model Accuracy:", result)
Model Accuracy: 0.9285714285714286
def predict_heart_disease(model, input_data):
print("Input data:", input_data)
# Convert input data to a list
input_data_as_list = list(input_data)
print("Input data as list:", input_data_as_list)
# Reshape the list as we are predicting for only one instance
input_data_reshaped = np.array(input_data_as_list).reshape(1, -1)
print("Input data reshaped:", input_data_reshaped)
# Make prediction using the model
prediction = loaded_model.predict(input_data_reshaped)
print("Prediction:", prediction)
# Return the prediction
return prediction[0]
# Function to take input from the user
def get_user_input():
age = int(input("Enter Age: "))
chest_pain_type = int(input("Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): "))
max_hr = int(input("Enter Max Heart Rate: "))
exercise_angina = int(input("Enter Exercise-Induced Angina (0 for No, 1 for Yes): "))
oldpeak = float(input("Enter Oldpeak: "))
st_slope = int(input("Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): "))
# Convert all input data to a list
input_data_as_list = [age, chest_pain_type, resting_ecg, max_hr, exercise_angina, oldpeak, st_slope]
return input_data_as_list
input_data = get_user_input()
result = predict_heart_disease(loaded_model_mlp, input_data)
# Print results
print("\nIndividual Input Data:", input_data, "\nPrediction:", "Has Heart Disease" if result == 1 else "No Heart Disease")
60 1 0 120 120 0 0 156 1 1 2 2 0 156 0
# Saving Model
import pickle
filename = "MLP_model.sav"
pickle.dump(best_mlp, open(filename, 'wb'))
# Loading Model
loaded_model_mlp = pickle.load(open("MLP_model.sav", 'rb'))
result = loaded_model_mlp.score(x_test, y_test)
print(result)
# Loading Model
loaded_model_mlp = pickle.load(open("MLP_model.sav", 'rb'))
result = loaded_model_mlp.score(x_test, y_test)
print(result)
0.9285714285714286
def predict_heart_disease(model, input_data):
print("Input data:", input_data)
# Convert input data to a list
input_data_as_list = list(input_data)
print("Input data as list:", input_data_as_list)
# Reshape the list as we are predicting for only one instance
input_data_reshaped = np.array(input_data_as_list).reshape(1, -1)
print("Input data reshaped:", input_data_reshaped)
# Make prediction using the model
prediction = loaded_model.predict(input_data_reshaped)
print("Prediction:", prediction)
# Return the prediction
return prediction[0]
# Function to take input from the user
def get_user_input():
age = int(input("Enter Age: "))
chest_pain_type = int(input("Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): "))
resting_ecg = int(input("Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): "))
max_hr = int(input("Enter Max Heart Rate: "))
exercise_angina = int(input("Enter Exercise-Induced Angina (0 for No, 1 for Yes): "))
oldpeak = float(input("Enter Oldpeak: "))
st_slope = int(input("Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): "))
# Convert all input data to a list
input_data_as_list = [age, chest_pain_type, resting_ecg, max_hr, exercise_angina, oldpeak, st_slope]
return input_data_as_list
input_data = get_user_input()
result = predict_heart_disease(loaded_model_mlp, input_data)
# Print results
print("\nIndividual Input Data:", input_data, "\nPrediction:", "Has Heart Disease" if result == 1 else "No Heart Disease")
60 1 0 120 120 0 0 156 1 1 2 2 0 156 0
Enter Age: 46 Enter Chest Pain Type (0 for Typical Angina, 1 for Atypical Angina, 2 for Non-Anginal Pain, 3 for Asymptomatic): 1 Enter Resting ECG (0 for Normal, 1 for ST-T Wave Abnormality, 2 for Left Ventricular hypertrophy): 0 Enter Max Heart Rate: 112 Enter Exercise-Induced Angina (0 for No, 1 for Yes): 0 Enter Oldpeak: 0 Enter ST Slope (0 for Upsloping, 1 for Flat, 2 for Downsloping): 1 Input data: [46, 1, 0, 112, 0, 0.0, 1] Input data as list: [46, 1, 0, 112, 0, 0.0, 1] Input data reshaped: [[ 46. 1. 0. 112. 0. 0. 1.]] Prediction: [0] Individual Input Data: [46, 1, 0, 112, 0, 0.0, 1] Prediction: No Heart Disease